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Abstract 



This article introduces trimmed estimators for the mean and covariance functional 
of data in general Hilbert spaces. The estimators are based on a data depth measure 
that can be computed on any Hilbert space, because it is defined only in terms of 
the interdistances between data points. We show that the estimators can attain 
the maximum breakdown point by properly choosing the tuning parameters, and 
that they possess better outlier resistance properties than alternative estimators, as 
shown by a comparative Monte Carlo study. The data depth measure we introduce 
can also be used for visual screening of the data and is a practical tool to detect 
clusters and isolated outliers, as shown by three real-data applications. 

Key Words: Abstract Inference; Data depth; Outlier detection; Robust statistics; 
Stochastic processes. 



1 Introduction 



Many statistical problems today involve the analysis of data that do not fit the clas- 
sical univariate or multivariate frameworks. For example, the data sometimes is 
a sample of univariate curves, like growth curves, EEG curves, spectrograms, and 
periodograms. Mathematically we can think of these univariate curves as realiza- 
tions of a stochastic process X in L 2 (R). The statistical analysis of this type of data 
has received much attention in recent years (see e.g. Ramsay and Silverman 2002, 
2005, and references therein). 

The space L 2 {R), being a separable Hilbert space, has many things in common 
with W. Therefore, the statistical concepts are not substantially different. For 
example, the covariance operator and its eigenfunctions in L 2 (IR) are a straight- 
forward extension of the covariance matrix and its eigenvectors in W, so principal 
component analysis is carried out in a virtually identical way. The crucial difference 
between W and L 2 (R) is the dimensionality. Concepts that are meaningful in a fi- 
nite dimensional space, like volume or Mahalanobis distance, cannot be extended 
in a useful way to infinite dimensional spaces. This complicates the development of 
outlier-detection tools and the definition of equivariant robust estimators, because 
most multivariate robust estimators are either based on the Mahalanobis distance 
(Maronna 1976; Davies 1987) or in notions of volume and data depth based on 
simplices (Oja 1983; Liu 1990). Even in finite dimensional spaces, if the sample 
size is smaller than the dimension (the "n < p problem", so common in Chemomet- 
rics and Microarray Data Analysis) traditional robust covariance estimators cannot 
be computed and alternatives must be sought (Filzmoser et al. 2008, Filzmoser et 
al. 2009, Hsieh and Hung 2009). 

In this article we will introduce a measure of "outlyingness" (or data depth) 
that does not depend on notions of volume or Mahalanobis distance, and therefore 
can be used in general Hilbert spaces for the detection of atypical observations and 
the construction of robust estimators. We will focus on infinite dimensional spaces, 
but the estimators can also be used in finite dimensional spaces and are particularly 
useful when the "n < p problem" is present. 

But the first question that arises is whether one needs, in practice, sample 
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Figure 1: Handwritten Digits Example. Eight instances of the number "five". 



spaces that are even more abstract than L 2 (R). The answer is yes. Consider, 
as a motivating example, the excitation-emission matrices used in Chemometrics 
(Mortensen and Bro 2006). Here the data is of the form Xi(s,t), where X t is the 
intensity of light emitted at wavelength t when the ith chemical compound is ex- 
cited at wavelength s. The resulting phosphorescence surfaces constitute a sample 
in L 2 (R 2 ). We cannot show this sample in a single plot, but we have created a 
movie that shows the 338 sample surfaces in quick succession (available as sup- 
plementary material) . This movie helps visualize the salient modes of variability 
and also the fact that there are some outiers in the sample. More about this data 
will be said in Section [6l At this point, however, it is clear that an outlier-screening 
procedure cannot consist of simply watching a tiresome movie that shows every 
sample surface; a more sophisticated and practical method is needed. 

As a second example, consider a handwritten digit recognition problem. The 
horizontal trajectory of the pen tip can be seen as a curve (x(t),y(t)) in R 2 , where x 
is the horizontal position, y is the vertical position, and t is time. The sample space 
here is L 2 (R) x L 2 (R) . A few digits are shown in Figure [Q they are all drawings of 
the digit "five", although some of them look like a "six". The reason is that there 
are two ways of drawing the number "five", as explained in Section [6l This creates 
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two different clusters in the sample of "fives", which are very difficult to determine 
without an outlier-detection tool (the sample contains 1055 handwritten "fives", 
so visual screening is out of the question). 

The problem of robust estimation in infinite-dimensional spaces has been ad- 
dressed by other authors, such as Locantore et al. (1999), Fraiman and Muniz 
(2001), Cuevas et al. (2007), Gervini (2008), Lopez-Pintado and Romo (2009), 
and Cuevas and Fraiman (2009). The approach varies. Locantore et al. (1999) 
and Gervini (2008) define estimators based on the spatial median and the covari- 
ance of normalized observations; the main drawback of these estimators is the 
low breakdown point of the principal components. The other papers are mainly 
concerned with data depth definitions and robust estimation is mostly seen as a 
by-product. The robustness properties of the estimators are therefore not well 
studied, and some of these data depth measures (specifically, the ones in Fraiman 
and Muniz 2001 and Lopez-Pintado and Romo 2009) are based on the ranks of the 
Xi(i)s and cannot be easily extended beyond L 2 (R). Strictly speaking, they do not 
even apply to the unrestricted space L 2 (R), since they assume that the X t s can be 
evaluated at each t. Formally, this problem can be circumvented by restricting the 
definition to the space of continuous square-integrable functions, but the real prac- 
tical problem is to obtain accurate estimates of X^t) for each t, especially when 
the XiS are sparsely observed. 

In this paper we propose an outlyingness measure that is entirely based on 
the interdistances between pairs of observations. Since any Hilbert space has a 
canonical norm and therefore a canonical metric associated with its inner product, 
this outlyingness measure can be defined on any Hilbert space. In fact, this measure 
could be defined on any metric space, but notions of directions of variability and 
principal components only make sense in spaces with inner products, so we will 
focus on Hilbert spaces in this paper. 

The idea of defining robust estimators based on interdistances is not entirely 
new. For example, Wang and Raftery (2002) used a nearest-neighbor cleaning ap- 
proach to defined robust covariance estimators. But this approach, as presented 
by the authors, cannot be extended to infinite-dimensional spaces (their weight- 
ing scheme depends on distributional assumptions that are only valid in finite- 
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dimensional spaces) . In contrast, the estimators we define are trimmed estimators 
based on the ranks of the outlyingness measures, so no distributional assumptions 
are needed. 

These estimators are very easy to compute and show very good robustness prop- 
erties in our simulations. In fact, we derive their breakdown point and show that 
they can attain the maximum 0.50 by appropriately choosing the tuning parame- 
ters. The definitions of the outlyingness measure and the trimmed estimators are 
presented in Section [2j Their finite-sample properties are studied in Section [3} 
and their population properties in Section @J The comparative Monte Carlo study 
is reported in Section [5l and the two real-data applications mentioned before are 
presented in Section [6l The Appendix contains some proofs. 

2 Trimmed mean and covariance estimators: defini- 
tion 

Given a sample X 1 , . . . , X n in a Hilbert space H, consider the interdistances d^ = 
|| Xi — Xj\\. An observation Xi is an outlier if it is far from most of the other obser- 
vations (not necessarily from all of them, since outliers sometimes form clusters) . 
Given a G [0, .50] define the a-radius as the radius of the smallest ball cen- 
tered at Xi that contains 100a% of the observations. This is easy to compute: for 
each i, the interdistances {dij} are sorted in the index j, obtaining the sequence 
e?t,(i), • • • ) di,{n)\ then ri = di t (j an -\), where \x] denotes the integer closest to x from 
above. 

The idea behind this definition is that the r^s will be small where the data 
is tight (usually near the center of the distribution) and large where the data is 
sparse, including regions with outliers. However, if there is a tight cluster of n* 
outliers and \an\ < n*, the r-jS for the outliers will be small, perhaps smaller than 
for the "good" data. For this reason, it is important to chose a large enough that at 
least one good observation will be captured by r» if Xi is an outlier. In general, only 
a = .50 will guarantee this (see Proposition [2] in Section [3]). In practice, however, 
it is instructive to draw box-plots and histograms of the r^s for different values of 
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a; the outliers tend to emerge clearly and consistently as a increases. 

The next step is to cut off or downweight the observations with largest radii. 
Given a trimming proportion (3 e [0, .5], define w(Xi) = I{r; < r(|- (1 „^) n ^}, where 
!{•} is the indicator function, or more generally w(Xi) = g (rank (r^/n), where 
rank(rj) is the rank of r, among the radii and g : [0, 1] — > R + is a bounded, non- 
negative and non-increasing function such that g(t) > for t < 1 - /3 and g(t) =0 
for t > 1 — f3. We will mostly use hard-rejection weights in this paper, but if a 
smoother downweighting scheme is preferred, one can choose a f3 1 > (3 and take, 
for instance, 

< r < a, 

9(r)={ (r-b) ^ + ™<£<?»» ], a<r<b, (1) 




r > b, 



with a = 1 — Pi and b = 1 — f3, which is a differentiable function that progres- 
sively downweights the largest 100/^% radii and cuts off the largest 100/3% radii 
completely (one can take, for instance, /3 X = .50 and a less drastic (3 = .20). 
The trimmed mean estimator is then defined as 

1 n 

and the trimmed covariance functional as 

(the tensor product / <%> g is defined as (/ ® = (/, /i)^ for h e H, and (/ <g> 
g)(hi, h 2 ) = (f, hi)(g, h 2 ) for h u h 2 G K). 

As with univariate trimmed estimators, the trimming proportion /3 determines 
the amount of outliers that fi and C can tolerate, as well as their efficiency (Maronna 
et al. 2006, chap. 2; Van der Vaart 1998, chap. 22). The robustness increases with 
(3 but the efficiency decreases, so in practice we recommend to choose (3 in a data- 
driven way. A histogram of the radii usually gives a good idea of the proportion of 
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outliers in the sample, and we recommend to choose f3 large enough to cut off all 
the outliers, but not larger. 

An important computational advantage of these estimators is that the weights 
w(Xi) depend on the data only through the interdistances dij, which can be com- 
puted entirely from the inner products (Xi, Xj). These inner products can be typ- 
ically well approximated by numerical integrals based on the raw observations of 
the XjS (see Gervini 2008, theorem 1); a finer reconstruction of the XjS is usually 
not required. This is an important practical advantage because the X^s are often 
sampled on sparse grids, and reconstructing them on a finer grid may be problem- 
atic (usually involving smoothing) . 

The principal components of <£ can also be computed using the inner prod- 
ucts (Xi,Xj) only. The principal components of <£ are the (non-null) functions 
<p e % such that £0 = A0 for some A e C, in which case A is called an eigenvalue 
and an eigenfunction. Since £ is a compact self-adjoint operator, the eigenval- 
ues are real and non-negative, and the number of strictly positive eigenvalues is 
countable when H is separable; in addition, each eigenvalue has a finite multi- 
plicity, and eigenfunctions corresponding to different eigenvalues are orthogonal 
(Gohberg et al. 2003, chap. IV). Then, without loss of generality, we can as- 
sume that the principal components {<j) k } are orthonormal and ordered so that 
Ai > A 2 > • • • > 0. As shown in Gervini (2008) or Jolliffe (2002, section 3.5), 
the principal components are linear combinations of the observations. Specifically, 
if «;< = w{X i )IYTi=MX i ), then fc = £r=i( c w/^ /2 )«£ /2 (*i - p) and A fc = l k , 
where c k is the kth unit-norm eigenvector of the matrix G e R nxn with elements 
Gij = {w] /2 (Xi - p) , w l J 2 (Xj —p)), and l k is the corresponding kth eigenvalue. Note 
that, after some algebra, the G^s can be expressed entirely in terms of the inner 
products (Xi,Xj) and the WiS. 
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3 Finite sample properties: equivariance and break- 
down point 

The reason our trimming scheme is based on interdistances is that the resulting 
weights w(Xi) are invariant under translation, re-scaling and unitary transforma- 
tions. As a consequence, the estimators ([2]) and O) satisfy the natural equivariance 
properties of location and scatter estimators. This is shown in Proposition Q] below. 

Recall that a unitary operator (the equivalent of an orthogonal matrix in MP) 
is a il : % — > H such that ||il/|| = ||/|| for every / e H, or equivalently such 
that it*il = ilU* = 3, where J is the identity operator and il* is the adjoint of il 
(for an operator 23 : "Hi — > H 2 > the adjoint is the unique operator 23* : % 2 — > Hi 
that satisfies (/, 98g) = (23*/, g) for every g e Hi and / e T-L^', this would be 
the equivalent of the transpose in MP). A unitary operator of particular interest 
is the "change of basis" operator il = ip k <g> <f> k , where {4> k } and {4> k } are two 
orthonormal systems in H. 

Proposition 1 Let X\, . . . , X n be a sample in H, a ^ a scalar, b a point in H, and 
il a unitary operator. Let Xi = ailXj + b, and denote by d^, f i} Ji, (£, X k and <fi k 
the interdistances, radii and estimators computed with the transformed data. Then 

d~ij = \a\dij, fi = \a\ r i} w(Xi) = w(Xi), fi = ail/t + b, <£ = a 2 ilCil*, = a 2 \t, and 
<l>k = U0 k . 

The proof is given in the Appendix. The key property here is that the radii 
are invariant under data rotations and shifts, and equivariant under scale changes. 
Thus, the ranks of the radii are invariant under the three transformations, as would 
be expected of an "outlyingness" measure: if an observation is deemed an outlier 
for certain data configuration, it should still be seen as an outlier if the dataset is 
simply shifted, rotated or rescaled. 

Let us now turn to the outlier- resistance properties of the trimmed estimators. 
We prove below that fi and € can tolerate, essentially, the smallest between 100a% 
and 100/3% of outlier contamination. We recommend using a large value of a, like 
.50 (a cannot exceed .50 by hypothesis in Proposition [2]), but adjusting /3 according 
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to the expected proportion of outliers. In that case the breakdown point of the 
estimators will be essentially (3. 

Formally, the finite-sample breakdown point of jl, denoted by e* (/t), is defined 
as follows (Donoho and Huber 1983): given a sample X = {X x , . . . ,X n }, let X 
be a contaminated sample obtained by replacing k points of X by arbitrary values; 
then e* n {p) = k*/n, where k* is the smallest value of k for which there is a sequence 
|^(m)j Q £ contaminated samples such that — > oo. The asymptotic break- 
down point £*{fi) is defined as the limit of e*(/t) when n — > oo, provided the limit 
exists. The definitions of e* n {£) and e*(<£) are analogous. 

Proposition 2 Suppose w(Xi) = g(rank(rj)/n), with g satisfying the conditions men- 
tioned in Section^ If a < .50, \an] > 3 and /3 < .50, then e* n (fi) = e*(£) = 
min([an], [/3n\ + 2)/n and e*(p) = £*(<£) = min(a,/3). 

The proof is given in the Appendix. In a few words, the proof shows that when 
the contaminating proportion is less than a, the \an\ -th observation closest to an 
outlier is necessarily a non-outlier (regardless of the configuration of the data), 
whereas the \an\ -th observation closest to a non-outlier is another non-outlier if 
the outliers are far enough from the bulk of the data. As a consequence, the radii 
of the outliers tend to be larger than the radii of the non-outliers, and trimming 
the largest radii will cut off the extreme outliers, as expected. This argument fails 
if there is a cluster of outliers with more than \an\ points, hence the importance 
of choosing a large a. 

4 Functional forms and population parameters 

Consider the family of probability distributions on %, V, which contains the em- 
pirical probability P n of the random sample Xx,...,X n as well as the population 
probability P from which the X, L s are drawn. To study the consistency of the esti- 
mators it is instructive to derive their functional forms, that is, to find functionals 
H : V U and £ : V U such that p, = fi(P n ) and £ = €(P n ), because under 
appropriate regularity conditions p,{P n ) — > fi(P) and €(P n ) — > €(P) in probability 
as n — > oo (Fernholz 1983; Van der Vaart 1998, chap. 20). In this paper we will 
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not be concerned with the precise conditions under which the estimators are con- 
sistent; instead, we will derive explicit expressions of /u(P) and £(P) and see how 
they relate to the population mean and covariance of P, if they exist. 

Given a probability P on H and a stochastic process X with distribution P, 
define F P (t;v) = P(\\X — v\\ < t) for each v G H. The radius of the smallest 
ball centered at v that encompasses probability a is r P (v) = F F> 1 (a;v), where 
Fp 1 (a;v) = min{t : F P (t;v) > a}, the usual quantile function. Then r P (X) is the 
a-radius around X, and if G P (t) := P{r P (X) < i], the weight function w P (v) has 
the form w P (v) = g[G P {r P (v)}], with g as in Section[2l Then the functional forms 
of © and © are 

^ ) E P {w P (X)} 

and 

won = Ep[w P (X){X - MP)} {X - MP)}] 

The eigenvalues and eigenfunctions of £(P) will be denoted by A fc (P) and 4> k {P), 
respectively. 

The first question that arises is whether n(P) and £(P) are well defined for any 
P G V. The next proposition shows that this is indeed the case, because r P (v) 
bounds \\v\\ for any P and then all weighted moments of \\X\\ are finite even if \\X\\ 
itself does not have finite moments of any order. 

Proposition 3 For any a > and any P <eV, there is a constant K ajP > such that 
\\v\\ < r P (v) + K^ P for all v G H. Therefore, if > 0, E P {w P (X)\\X\\ k } < oo for 
any k > 0. 

Another important property of r P (v) is that it really is a measure of outlying- 
ness, or depth, in the sense that r P (v) is smaller in regions of H where P is dense 
than in regions where P is sparse. 

Proposition 4 Ifv and w are two points in H such that P(Bg(v)) > P(Bg(w)) for all 
5 > (where B s (v) denotes the hall with center v and radius 5), then r P (v) < r P (w). 

The equivariance of fi and £ (Proposition []]) carries over to yu(P) and £(P). 
That is, if P is the probability distribution of X and P denotes the probability 
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distribution of aitX + b, with il a unitary operator, a^O, and b e "H, then r p (i> ) = 
|a| r P (u) for all v e H, fi(P) = oU/x(P) + 6, and £(P) = a 2 HC(P)H*. The proof is 
given in the Appendix. 

The next property shows that if the distribution of X has a point of symmetry 
/i , then fi(P) = /i . In that case E(X), if finite, is also equal to /i , so the population 
mean and the trimmed mean coincide. 

Proposition 5 If X has a symmetric distribution about n Q (i.e. X — jj, and /i — X 
have the same distribution), then n{P) = /v 

Now suppose that X admits a decomposition of the form 



with probability 1, where {Z k } C R are random variables and fi e H, {A fc} C R, 
and {<f) Qk } C H are parameters such that the <f) ok s are orthonormal and the A fcS are 
positive, non-increasing, and J2 k X ok < oo (the sequence might be finite or infinite, 
so we omit the limits of summation) . This decomposition holds, for instance, if X 
has finite second moments: in that case /x = E(X), and {X k} and {4> ok } are the 
eigenvalues and eigenfunctions of the co variance operator E P { (X — /^ ) <g) (X — /i ) } 
(this is a consequence of the spectral theorem; see Gohberg et al. 2003, chap. IV). 
In this case © is known as the Karhunen-Loeve decomposition (Ash and Gard- 

1 /2 

ner 1975, chap. 1.4), and Z k = (X — n ,(fi ok )/\ ok , which are uncorrected with 
E(Z k ) = and V(Z k ) = 1. 

But expansion © is valid under more general conditions. For example, if 
the Z k s are independent, Kolmogorov's Three Series Theorem (Gikhman and Sko- 
rokhod 2004, p. 384) implies that J2k Ki Zk<fio k converges almost surely in % if 
and only if J2 k P(X ok ZQ k > c) < oo for every c > 0. The latter is satisfied if the A fcS 
go to zero fast enough; it is not necessary that the Z k s have finite moments. For 
instance, if each Z k has a Cauchy distribution, then 




(4) 



A: 
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^ ^ Icy 

which converges for any c > as long as J2k Ki < 00 • 

Note that under © the interdistances dy = \\Xi — Xj\\ can be written as c% = 
{J2k ^ok(Zki — Zkj) 2 } 1 / 2 . Therefore their distribution, and consequently that of the 
radii, will only depend on the Z k s and the A fcS; the intrinsic elements of H in 
expansion (0]), namely and the 4> ok s, do not play any role in the distribution of 
the dijS. To put it in different words: the particular space % where the data lives 
does not determine the important properties of the estimators; the eigenvalues and 
the component scores do. 

Proposition 6 If expansion holds with independent and symmetrically distributed 
Z k s, then 

k 

for a positive sequence {X ok } that does not depend on /x or {4> ok }, but only on the 

1 /2 

distribution of {X ' k Z k }. In addition, if the Z k s are identically distributed, then X 0j = 
X ok implies X 0j = X ok . 

Proposition [6] indicates that the set of trimmed principal components {(j) k (P)} 
coincides with the set {<f) ok }, although we do not know whether the sequence {X ok } 
is non-increasing, so it is not necessarily true that 4> k (P) = 4> ok for each k. It is 
clear, however, that the dimensionality of the model is not changed, because 

~ x E P {w P {X)\{X-^^ k )\ 2 } 
E P {w P (X)} 

and then {Aofe} has the same number of elements as {X Qk } if {A fc} is a finite se- 
quence. Moreover, the multiplicity of the eigenvalues associated with each 4> ok is 
preserved if the Z k s are identically distributed. 

5 Simulations 

We ran some simulations to study the robustness and efficiency of the trimmed 
estimators, especially when compared to the spatial median and spherical principal 
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components (Gervini, 2008) . One of the motivations for the present work was to 
improve on the low breakdown point of the spherical principal components, which 
we have accomplished, as shown below. 



We simulated data from the model X(t) = J2T=i ^fcV%fc</>fc(*) with Z k s indepen- 
dent with N(0, 1) distribution, and 4> k (t) = y/2sm(irkt) fort e [0, 1]. Two sequences 
of eigenvalues were considered: a slow-decaying sequence X k = l/{k(k + 1)} 
(Model 1) and a fast-decaying sequence X k = l/2 k (Model 2); note that J2T=i ^ = 
1 for both models. For practical purposes, Model 2 behaves like a finite-dimensional 
model, since the first five terms account for 97% of the total variability; Model 1, 
on the other hand, needs 31 terms to accumulate the same proportion of the vari- 
ability, so it can be seen as a truly infinite-dimensional model. For the actual data 
generation we truncated Model 1 at the 1000th term and Model 2 at the 10th term, 
which represent 99.9% of the total variability in both cases. The curves were sam- 
pled on an equally spaced grid of 50 points. The sample size was n = 100 for all 
the scenarios described below, and each case was replicated 500 times. 

First we studied the behavior of the sample mean, the median, trimmed means 
with hard-rejection weights, and trimmed means with soft-rejection weights (de- 
fined through the function CD). For the hard-rejection trimmed means we consid- 
ered the four possible combinations of the parameters a = .20, a = .50, /3 = .20, 
and = .50. For the soft- rejection trimmed mean we also considered a = .20 and 
a = .50, but only (3 = .20 (the parameter (i x was .50). In addition to the non- 
contaminated data, we generated outliers by adding 30! (i) to the first ne sample 
curves. Four contaminating proportions e were considered: .10, .20, .30, and .40. 
The simulated root mean squared errors {E(||/i|| 2 )} 1//2 are reported in Tabled! In 
this table, and also in Tabled "Hard (.20,. 50)" denotes the hard-rejection trimmed 
estimator with a = .20 and (3 = .50, and so on. 

We see that the trimmed means compare favorably with the spatial median 
in terms of robustness. Although the median is more efficient for clean data, it 
becomes more biased at higher levels of contamination. Regarding the parameter 
a, we see that there are no advantages in taking a = .20 instead of a = .50. As for 
f3, the hard-rejection trimmed mean with /3 = .20 is slightly more efficient than the 
hard-rejection trimmed mean with (3 = .50 for clean data, but much less robust at 
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Table 1: Simulation Results. Root mean squared errors of location estimators for 
different models and contamination proportions. Estimators are: sample mean, 
median, and trimmed means with hard- and soft- rejection weights. 



high levels of contamination; so the choice (3 = .50 is clear in this case. The soft- 
rejection (.50, .20) trimmed mean is slightly better than the hard-rejection (.50, .50) 
trimmed mean for small es, but the latter is more robust for large es. So the choice 
between soft and hard trimming is not clear-cut from Table [0 

We also compared the behavior of estimators of the first principal component. 
We computed sample principal components, spherical principal components, and 
trimmed principal components with the same weights and parameters a and f3 as 
above. Now the outliers were generated in a different way: we added 34> 2 (t) to 
the first ne/2 sample curves and subtracted the same quantity to the next ne/2 
sample curves. This type of contamination inflates the variability in the direction 
of the second component but does not affect the mean. Therefore, a non-robust 
estimator of the first principal component will be closer to 2 than to l5 and the 
error ||^ — <f> x \\ will be approximately \[2 (this would be considered breakdown 
for the first principal component). We considered the same four contamination 
proportions e as before. The simulated root mean squared errors {i^dl^— 1 || 2 )} 1 / 2 
are reported in Table [21 
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Model 1 



Model 2 



e e 



Estimator 





.10 


.20 


.30 


.40 





.10 


.20 


.30 


.40 


Sample p.c. 


.13 


1.34 


1.38 


1.39 


1.39 


.17 


1.35 


1.38 


1.39 


1.39 


Dpnencai p.c 


1 c 
. ID 


.o 1 


i no 


1 "39 
l.jZ 


1 "37 
l.O / 


91 


.01 


1 9f> 
l.ZU 


1 


J..O/ 


Hard(.20,.20) 


.22 


.20 


.15 


1.32 


1.38 


.30 


.30 


.26 


1.32 


1.38 


Hard(.50,.20) 


.22 


.21 


.16 


1.32 


1.38 


.31 


.32 


.26 


1.32 


1.38 


Hard(.20,.50) 


.35 


.36 


.33 


.31 


.32 


.46 


.47 


.47 


.46 


.46 


Hard(.50,.50) 


.35 


.37 


.36 


.34 


.35 


.43 


.45 


.49 


.53 


.68 


Soft(.20,.20) 


.27 


.27 


.25 


.20 


1.26 


.37 


.39 


.37 


.33 


1.28 


Soft(.50,.20) 


.27 


.28 


.25 


.20 


.93 


.35 


.37 


.39 


.37 


1.13 



Table 2: Simulation Results. Root mean squared errors of first principal component 
estimators for different models and contamination proportions. Estimators are: 
sample p.c, spherical p.c, and trimmed p.c. with hard-rejection and soft-rejection 
weights. 



The conclusions from Table [2] largely agree with those from Table [TJ except that 
the soft-rejection (.50, .20) estimator is now clearly preferable to the hard-rejection 
(.50, .50) estimator; only for the extreme case e = .40 is hard-rejection better. A 
noticeable fact in Table [2] is the low breakdown point of the spherical principal 
component: for all practical purposes, it breaks down at e = .20, while the soft- 
rejection (.50, .20) estimator does not seriously deteriorate until e = .40, and the 
hard-rejection (.50, .50) estimator has a low error even then. Although the trimmed 
principal components are considerably less efficient than the spherical principal 
components for non-contaminated data, we think that their superior robustness 
compensates for this. 
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Figure 2: Handwritten Digits Example, (a) Sample mean, (b) 41% trimmed mean, 
and (c) mean of the trimmed observations. 

6 Applications 

6. 1 Handwritten Digits 

This data was collected from 44 persons who wrote 250 samples of each digit 
0, 1, . . . , 9. The dataset is available at the Machine Learning Repository of the Uni- 
versity of California at Irvine, h ttp://archive.ics.uci.edu/ml71 , and has been previously 
analyzed by Izenman (2008, chap. 7). For each handwritten digit the trajectory of 
the pen's tip, (x(t),y(t)), was recorded at eight equally spaced time points. The 
data was rescaled and rotated so that x and y range between and 100, and 
t between and 1. Therefore, the sample space is L 2 ([0, 1]) x L 2 ([0, 1]), which 
we endow with the canonical inner product ((xx,yi), (2^2, 2/2)) = J x 1 {t)x 2 {t)dt + 
Jq 1 y 1 (t)y 2 (t)dt. We will only analyze the data corresponding to the digit "5", of 
which there were n = 1055 replications available. 

At first glance, many of the handwritten digits in Figure [T] seem to correspond 
to the number "6" rather than "5". The reason is that (x(t),y(t)) does not track the 
vertical movement of the pen, and the way most people draw the number "5" is in 
two strokes: they begin at the upper left corner and move downwards, then raise 
the pen and draw the top dash. But there is another way to draw the number "5": 
in a single stroke, beginning at the upper right corner (like the letter "S"). The 
sample of 1055 handwritten "fives" is a mixture of both, and one soon realizes that 
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Figure 3: Handwritten Digits Example. Histogram of the radii with a = 



.50. 



the classical principal component analysis does not yield much useful information. 
The sample mean, shown in Figure [2](a), does not resemble a "5" or any other 
recognizable digit. When the leading 20 principal components are computed, one 
finds that the first component alone explains 78% of the variability. The scores of 
this component show a clear bimodal distribution, corresponding to the two types 
of "fives". Then the first principal component can be used for discrimination, but 
does not really provide useful information about subtler forms of variability in the 
written digits. 

A better way to spot the two clusters is to use the radii. We computed radii 
for different values of a, and noticed that their distribution becomes increasingly 
bimodal as a increases. The histogram for a = .50 is shown in Figure [3l There 
are two neatly distinguishable groups: 627 observations with < 60, and 428 
observations with r { > 60. This does not necessarily imply that there are two 
clusters in the data, because large values of r { that are close to each other do not 
necessarily correspond to curves that are close to each other in H. But the large 
number of extreme radii (40.5% of the data) tends to point out to a systematic 
departure from a homogeneous population rather than isolated outliers. This is 
confirmed by a plot of the trimmed mean with (3 = .41 (Figure [2Kb)), together 
with the mean of the observations that were cut off (Figure [2](c)). Clearly, Figure 
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Figure 4: Handwritten Digits Example. Effect of the trimmed principal components 
[(a) first, (b) second] on the trimmed mean. Solid line is the trimmed mean; 
dashed line is the trimmed mean plus 4 times the p.c; dotted line is the trimmed 
mean minus 4 times the p.c. 

EKb) (and then the observations with < 60) correspond to the subclass of "fives" 
drawn in two strokes, whereas Figure EKc) (and then the observations with > 60) 
correspond to the more unusual "fives" drawn in one stroke. 

The trimmed principal components do provide useful information about the 
main sources of variability in the bigger subclass. The first two components account 
for 56% and 14% of the total variability, respectively. The easiest way to interpret 
the components is to visualize their effects on the mean; so we see in Figure [4] that 
the first principal component mainly explains variation in the inclination of the 
"belly" of the digit, while the second principal component mostly explains variation 
in the slant of the vertical dash (which also has an effect on the position of the 
"belly"). 

The existence of more clusters can be investigated by recomputing the interdis- 
tances and the radii for each subclass independently. We did that and found that 
the radii had clear unimodal distributions, so there does not seem to be any other 
subclass of "fives" in addition to the two found in this section. 
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6.2 Excitation-Emission Matrices 



In an enzyme cultivation process, real-time quality control is important because it 
allows for prompt adjustments in the process. To this end, samples of the cultiva- 
tion broth are taken at regular intervals and the enzyme activity is measured. A 
traditional chemical analysis of the samples can determine enzyme activity very ac- 
curately, but it may take hours or days to perform. An alternative method employs 
multi-channel fluorescence sensors, which yield immediate results in the form of 
excitation-emission matrices (EEMs). But enzyme activity can be determined only 
indirectly from these matrices, using principal component analysis and partial least 
squares. The process is described in detail in Mortensen and Bro (2006); here 
we are going to use the data analyzed by these authors, which is available at 
http:/ /www. models. I ife.ku.dk/research/data/. 

Mortensen and Bro (2006) mention using a calibration sample of 283 EEMs, a 
test sample of 53 EEMs and an additional 15 EEMs of pure enzyme in their analy- 
sis. However, the calibration sample available online contains 338 EEMs (the test 
sample is also available but we do not use it here) . It is not clear if Mortensen and 
Bro (2006) did a pre-screening of the calibration set and discarded some observa- 
tions, but it is clear that the sample of 338 EEMs contains outliers. This is easy to 
see by plotting all the EEMs in quick succession; a movie showing this is available 
as supplementary material. 

We carried out the principal component analysis on the logarithm of the light 
intensities, rather than directly on the light intensities. This ameliorates the effect 
of the outliers to some extent, but not completely. A histogram of the radii (Figure 
[5]) shows that 15 observations are clear outliers (we suspect these are the pure- 
enzyme EEMs alluded to in the paper, although the website we downloaded the 
data from does not say so explicitly). There does not seem to be any other outliers, 
so we computed the hard-rejection trimmed mean and the leading 20 principal 
components with (3 = .05. The first component accounts for 59% of the total 
variability, and the second component for 20%. They are shown in Figure [6l 

The sample mean and principal components are also shown in Figure [6l The 
first component explains 88% of the variability, and the second component only 
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Figure 5: Excitation-Emission Matrices Example. Histogram of the radii with a = 
.50. 



Figure 6: Excitation-Emission Matrices Example, (a) Trimmed mean, (b) first 
trimmed principal component, (c) second trimmed principal component, (d) sam- 
ple mean, (e) first sample principal component, and (f) second sample principal 
component. Trimmed estimators were computed with 5% trimming. 
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Figure 7: Excitation-Emission Matrices Example. Scatter-plots of standardized 
component scores versus enzyme activity measure, for (a) the second trimmed 
principal component and (b) the second sample principal component. 

5%. Note that while the sample mean and the trimmed mean do not differ very 
much, the first sample component is completely different from the first trimmed 
component. The reason is clear: as in the previous example, the first sample prin- 
cipal component only explains how the outliers vary from the non-outliers; it may 
be useful for outlier detection, but it is not associated with any genuine source 
of variability. Moreover, it downplays the importance of the second component, 
which is assigned only 5% of the total variability. 

In fact, it is the second component the one primarily associated with enzyme 
activity. Figure [7] shows scatter plots of the standardized component scores versus 
an enzyme activity measure (the 15 outliers were removed). The association be- 
tween component scores and enzyme activity, although not entirely linear, is clearly 
stronger for the trimmed principal component scores (the correlation coefficient is 
.69) than for the sample principal component scores (the correlation coefficient is 
.57). 

These examples show that the principal components can be seriously affected 
by outliers in the data. These outliers can sometimes be detected by other means, 
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especially when they form a single cluster (for instance, by a classical discrimina- 
tion method or by the first sample principal component, as in these examples). But 
isolated outliers are harder to detect because they have a masking effect on the 
estimators: they distort the principal components without appearing as unusual 
observations in a plot of the component scores or the residuals. Such cases can 
only be properly handled with robust estimators. A third example that illustrates 
this problem was not included for reasons of space but is available as supplemen- 
tary material. 

Appendix 

Proof of Proposition QQ 

If Xi = aSXXi + b with il unitary, then c% = || (aiXXi + b)- (attXj + b)\\ = \a\ - 
Xj)\\ = \a\dij. This implies that fj = |a|r* and then w(Xi) = w(Xi), because 
rank(fj) = rank(rj). Therefore 



Ei=i w ( x i) 



ailjj + b, 



and for v e % we have 




■n 



^^(X,)(aii(X 4 - p),v){aaiXi - ft)} 




a 2 U{£(iTv)}, 



so € = a 2 tt<£ii as claimed. 
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Proof of Proposition [2] 

We will first show that £*(/*) > min( [cm] , [/3n\ +2)/n, and then proceed to show 
that the equality holds by exhibiting a particular sequence of contaminations that 
makes \\p\\ go to infinity. 

Suppose, then, that {X^} m >i is a sequence of contaminated samples obtained 
from X by replacing k observations, and such that ||/r m ^|| — > oo as m — > 00. Since 
|| . (m) || < y™ =1 w{x\ m) ) ||X. (m) ||/ J2ti w (xt } )> ° ne can choose a sequence of points 
r^Mj suc j 1 ||x} m) || — >■ 00 when m — >• oo and u;(X- m) ) > for all m. 
The latter implies that the rank of the corresponding radii is strictly less than 
(1 — p)n, because g{t) = for any t > 1 — ft; therefore f|™ ) < rrJ 1 }^,,. 

Since ff m ' ) is defined as the distance between X) and its Tan] -th closest point 

l m Lm 1 1 J. 

in A^" 1 ), and X^ has only A; outliers, if k < [an] there has to be at least one point 
Xj of the original sample X such that ||X-"^ —X* II < r- m) . Since ||X-"^|| — > 00, it 

jra »_> ± 11 t m jm 11 — i m 11 t m 11 - 1 

follows that ||lj m) - X, II — )■ 00 when m — >■ 00 (the X. s are bounded) , so rf m ' ) — > 

00 as well. Butf^ < fjm.^-m so a total of at least l+(n-[(l-/3)n]+l) = [/3nJ+2 
radii f ■ go to infinity when m — > 00. 

Now, if A; < [an] then the number of non-outliers in X^ is n — k > [an] 
(because a < .50), so for each non-outlier X { in X^ there are at least [an] non- 
outliers Xj such that = ||X; — Xj-|| remains bounded regardless of m; therefore 
the corresponding f^s cannot go to infinity. This means that the [/3n\ +2 radii that 
go to infinity correspond to observations in X^ that are outliers, so the number 
of outliers k cannot be less that [/3nJ + 2. Then k > [(3n\ + 2 when k < [an] . The 
other possibility is that k > \an\, so k > min([ an], \J3n\ + 2). This proves that 
e* n {p) > min( [an], [f3n\ +2)/n. 

To see that e* n (fi) = mm(\an], [(3n\ + 2)/n, take k = min( [an], [f3n\ + 2) and 
consider the two possibilities: [f3n\ + 2 < [an] or [f3n\ + 2 > \an\. If [f3n\ + 2 < 
[an], take any X £ H with norm one and define the outliers X> m ' = m l X , for 

1 = 1, . . . , k; then the distance between each X- m ^ and any other point in X^ 
(including other outliers) goes to infinity when m — > oo, so f | m) — > oo for i = 
1, . . . , k; since k = [/3n\ + 2 in this case, at least one of the outliers is not cut off 
and then ||/t^|| — > oo as in — > oo. For the other case, [(3n\ + 2 > [an], define 
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the outliers X^ = mX for i = 1, . . . , k; then r\ m ' = for i = l,...,k (because 
k = \an] in this case), but > for the non-outliers (except in the trivial 
situation where all the sample points are identical) . The number of non-outliers is 
n — k = n — \an\ > n — \_/3n\ — 2, so n — k > n — [f3n\ — 1 > [f3n\ — 1 (because 
(3 < .50). Therefore the n — \(l — j3)n\ + 1 = [f3n\ + 1 observations that are cut 
off include at most two outliers, and since \an\ > 3 by hypothesis, there is at least 
one outlier that is not cut off, so ||/i (m ' ) || — > oo as in — > oo. 

For £*(£) the proof is similar, because ||£|| < Y.™=i w ( x i)\\ x i\\ 2 / Y,™=i w ( x i) + 
\\fi\\ and the preceding proof is also valid for YH=i w { x i)\\ x i\\ 2 1 YH=i w ( x i)- 

Proof of Proposition [3] 

By definition, a < P{\\X— v\\ < r P (v)} for any v e H. Since \\X— v\\ > \ \\X\\ — \\v\\ |, 
it follows that a < P{\\v\\ — r P (v) < \\X\\}. But if a > 0, there is a finite K a ^ P such 
that P{||Jf|| > K a)P } < a/2, say. Therefore \\v\\ — r P (v) < K a>P for any v e H. 
Then, if (3 > 0, w P (v) > implies r P (v) < G P 1 (1 — /3), which is finite, so 

E P {w P (X)\\X\\ k } < E P \I{\\X\\<K a , P + G P \l-(3)}\\ x \\ k ] 
< {K a<P + G p 1 (1 - f3)} k <oc 

for any k > 0. 

Proof of Proposition m 

Suppose r P (v) > r P (w). By definition, r P (v) = mm{5 : P(B s (v)) > a}, so 

(«)) < a. ButP(B rp{w) {v)) > P{B rp{w) {w)) by hypothesis, and P{B rp{w) {w)) > 
a by definition of r P (w), a contradiction. Then it must be r P (v ) < r P (w). 
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Proof of the equivariance of n(P) and <£(P) 
Let X ~ P and X = ailX + 6 ~ P. Then 



Fp(t;t>) = P(||aUX + 6-u|| <*) = P(\\X-U*(v-b)/a\\ <t/\a\) 
= F P (t/\a\;U*(v-b)/a), 

so rp{v) = \a\r P (ii*(v — b)/a) for all v EH. Then 

Gp(t) = P{r P (X) <t} = P{r P (aiiX + b) <t} 
= P{\a\r P (X)<t} = G P (t/\a\), 

which implies that 

wp(v) = g[Gp{rp(v)}]=g[G P {rp(ii*(v~b)/a)}] 
= w P (iX*(v-b)/a) 

for all veH. Therefore Ep{w P (X)} = E P {wp(aUX + b)} = E P {w P (X)} and 

Ep{wp(X)X} = E P {wp(aiiX + b)(ailX + b)} = E P {w P (X)(aUX + b)}, 

from which it follows that ii(P) = all^(P) + b. In a similar way one obtains <£(P) = 
a 2 iX<t(P)il*. 

Proof of Proposition [5] 

It follows immediately from the equivariance of n(P). Let X ~ P, X — /i ~ P a and 
-X + //„ ~ P 2 . Then ^(P x ) = //(P) - ^ and /i(P 2 ) = -//(P) + /i . But Pi = P 2 by 
hypothesis, so /i(P) — p = — //(P) + /i , which implies that /x(P) = /x - 

Proof of Proposition [6] 

This result is also a direct consequence of the equivariance of the estimators. Since 
<£(■) is location invariant we will assume, without loss of generality, that /i = 0. 
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Then, since the Z k s have a symmetric distribution about 0, by Property [5] we have 
H(P)=0. Therefore 



C(P) 



Ep{w P {X){X®X)} 
E P {w P (X)} 

= E { \ x)} E E z^z^ ® ofc ) 

P l P V /J j ^ 

= 1 p(x)| E E MM*) *) (<f>0k, X)}(<f> 0j ® 00*) ■ 

j A; 

We will show that E P {wp(X)((f) 0j , X)(4> ok , X)} = for j 7^ fc. Since {<f) ok } is an 
orthonormal system in H, if it is not complete we extend it to an orthonormal 

basis of H, {4>Q k }, and correspondingly we extend the sequences {A fc} and {Z k } 

- 1/2 - - 

by adding zeros, so that we can write X = J2k Kk Zk4>ak with probability one. 
Now, any operator of the form (5 = J2 k a k (cj) ok <g> <f) ok ) with a k = ±1 is unitary and 
self-adjoint, so we know by the proof of equivariance above that if P denotes the 
probability distribution of &X, then wp(v) = wp(&v) for all v G %. But under the 
current assumptions the Z k s are independent and symmetrically distributed around 
(including the artificially added Z k s), so &X and X have the same distribution; 
that is, P = P and then wp(v) = wp(&v) for all v G H. Therefore 

Ep{w P (x)^ 0j ,x)^ ok ,x)} = E P {w P (ex)$ oj ,ex)$ ok ,ex)} 

= E P {wp(x)(e^,x)(&^ k ,x)}. 

If, in particular, we take &j to be the sign-change operator for the jth coordinate 

(a, = —1 and a k = 1 for any k ^ j), we have &j4> 0j = —<fi 0j and &j<fi ok = <p ok for 
any k ^ j, so 

E P {w P (X)(G^ op X)(e^ ok ,X)} = -Ep{w P (X)(^,X)(^ k ,X)}. 
This implies that E P {wp(X)((f) 0j , X)((f) ok ,X}} = 0, as we wanted to prove. Then 

£ ( P ) = E^^ofc®0o fc ) 

k 
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with 

~ E P { W p(X)|(0 ofc ,X)| 2 } 
E P {w P {X)} 

as claimed. 

Now, since F P (t;v) = P(\\X - u|| < t) = P{£ k 0ikZ k - 6 k ) 2 < t}, with 

1/2 

9 k = {<p ok , v), it is clear that rp(v) depends on P only through {\ ' k Z k } and does not 
depend on // or on the (f) ok s. It is also clear that r P (v) depends on v only through 
its Fourier coefficients {9 k }, so the distribution of r P (X), G P (t), depends on P only 

1 /2 

through {\ ' k Z k } (because these are the Fourier coefficients of X). Consequently, 
the weight function w P (v) = g[G P {r P (v)}] depends on P only through {\ l J k Z k }, 
and depends on v only through {6 k }, implying once again that the distribution of 

1 /2 1/2 ~ 

wp(X) depends only on {\ ' k Z k }. Since (<f> 0k ,X) = \ ok Z k , it is clear then that A fc 

1 /2 

depends on P only through {\ ' k Z k } and does not depend on /i or the 4> ok s. 

Finally, suppose that the Z k s are identically distributed in addition to being 
independent, and that A j = Xok- Then the distribution of X remains unchanged if 

1 /2 1 /2 

we switch A ^ Zj with \ ok Z k . More formally, define the "switch operator" & jk = 
Ooj ® O fc + 0ofc ® (f> 0j ) + {3 - Ooj ® O fc + 0ofc ® 0oi)} 3 where ^ is the identity 
operator. Then, if X = J2 k *ik Z k<i>ok> we have & jk x = K^Zjhk + K^Z^oj + 
k Ki Zi4> i, so X and &j k X have the same distribution. Moreover, since & jk 
is unitary and self-adjoint, w P (v) = w P (&v) for all v e H, as before, so 

E P {w P (X)\(<P 0j ,X)\ 2 } = E P {w P (& jk X)\{<j> 0j ,e jk X)\ 2 } 

= E P {w P {X)\(e jk cf> Qj ,X)\ 2 } 

= E P {w P (X)\(4^,X)\ 2 }. 

Then A j = A fc, as claimed. 
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